- 
                Notifications
    You must be signed in to change notification settings 
- Fork 477
Add exponential backoff retry logic to Netlify builds #20409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Add test-remote-failure.md with multiple failure scenarios - Configure netlify.toml for retry testing environment - Keep cache plugin for performance during testing - Ready for manual PR creation to trigger deploy preview
- Create comprehensive build test script with retry capabilities - Add detailed logging and build statistics - Support both retry testing and stress testing modes - Include build timing and attempt tracking - Make script executable for Netlify deployment
- Remove test-remote-failure.md with intentional failures - Update netlify.toml for production retry configuration - Rename build-test.sh to build.sh and remove test-specific code - Configure 3 retries with 30s base exponential backoff delay - Simplify logging for production use while keeping retry functionality
| ✅ Deploy Preview for cockroachdb-interactivetutorials-docs canceled.
 | 
| ✅ Deploy Preview for cockroachdb-api-docs canceled.
 | 
| Files changed:
 | 
| ✅ Netlify Preview
 To edit notification comments on pull requests, go to your Netlify project configuration. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ebembi-crdb I like the logging and exponential backoff. I tested the PR with some common error types (via a few commits on a branch off your branch) and found the following:
- Good behavior: htmltest failures (bad anchor links) do not retry - this is correct since these are permanent errors (commit/log)
- Problematic behavior:
Both of these are permanent errors that won't resolve on retry.
I would suggest adding error classification to only retry transient network errors (and in the future, we could always add additional possible criteria).
Specifically, per Claude, I believe we could modify the build_with_monitoring function to capture and analyze Jekyll's output:
function build_with_monitoring {
    local config=$1
    local build_log="build_${ATTEMPT_COUNT}.log"
    # Capture Jekyll output for analysis
    if bundle exec jekyll build --trace --config _config_base.yml,$config 2>&1 | tee "$build_log"; then
        return 0
    else
        local exit_code=$?
        # Only retry on transient network errors
        if grep -qE "Temporary failure in name resolution|SocketError|Connection refused|Connection reset|Failed to open TCP connection" "$build_log"; then
            return 2  # Retryable error
        else
            # Permanent errors: Liquid syntax, missing files, ArgumentError
            echo "❌ Permanent build error detected - not retrying"
            return 1  # Non-retryable
        fi
    fi
}Then update build_with_retries to check the return code:
if build_with_monitoring "$config"; then
    success=true
    break
else
    local result=$?
    if [[ $result -eq 1 ]]; then
        echo "Permanent error - failing immediately"
        break  # Don't retry
    fi
    # Only retry if result == 2 (transient error)
fiI have not tested this, but I wanted to share it in case it is helpful.
One way or another I do think it is worth only retrying on DNS/network failures, since I can't think of any other failure that it would've benefitted us to retry, and I know that retrying on those other cases I mentioned would waste time for writers when they aren't closely watching their build log.
Implement smart error classification to only retry transient network errors, preventing unnecessary retries on permanent build failures that waste 4-5 minutes. Changes: - Capture Jekyll build output to log files for error analysis - Classify errors into retryable (network/DNS/SSL) vs permanent (Liquid syntax, missing files) - Return different exit codes: 0=success, 1=permanent error, 2=transient error - Update retry logic to immediately fail on permanent errors - Only retry on actual network connectivity issues Benefits: - Eliminates 4-5 minute delays from retrying Liquid syntax errors - Eliminates delays from retrying missing file reference errors - Provides clear feedback explaining retry decisions - Conservative approach: unclassified errors default to non-retryable Addresses PR review feedback from #20409 to improve build experience for writers encountering permanent errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I think this will work well, and we can always tweak the grep statements later if needed, of course. Thanks again for working on this.
Summary
Implements robust retry logic with exponential backoff for Netlify documentation builds to handle transient network
failures and remote include errors.
Changes
Benefits
Configuration
MAX_RETRIES: Number of retry attempts (default: 3)BASE_RETRY_DELAY: Base delay in seconds for exponential backoff (default: 30)Testing
Validated with intentional failure scenarios including:
The retry mechanism successfully recovers from transient failures while failing fast on persistent issues.